A new imputation method for small software project data sets

نویسندگان

  • Qinbao Song
  • Martin J. Shepperd
چکیده

Effort prediction is a very important issue for software project management. Historical project data sets are frequently used to support such prediction. But missing data are often contained in these data sets and this makes prediction more difficult. One common practice is to ignore the cases with missing data, but this makes the originally small software project database even smaller and can further decrease the accuracy of prediction. The alternative is missing data imputation. There are many imputation methods. Software data sets are frequently characterised by their small size but unfortunately sophisticated imputation methods prefer larger data sets. For this reason we explore using simple methods to impute missing data in small project effort data sets. We propose a class mean imputation (CMI) method based on the k-NN hot deck imputation method (MINI) to impute both continuous and nominal missing data in small data sets. We use an incremental approach to increase the variance of population. To evaluate MINI (and k-NN and CMI methods as benchmarks) we use data sets with 50 cases and 100 cases sampled from a larger industrial data set with 10%, 15%, 20% and 30% missing data percentages respectively. We also simulate Missing Completely at Random (MCAR) and Missing at Random (MAR) missingness mechanisms. The results suggest that the MINI method outperforms both CMI and the k-NN methods. We conclude that this new imputation technique can be used to impute missing values in small data sets. 2006 Elsevier Inc. All rights reserved.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Short Note on Using Multiple Imputation Techniques for Very Small Data Sets

This short note describes a simple experiment to investigate the value of using multiple imputation (MI) methods [2, 3]. We are particularly interested in whether a simple bootstrap based on a k-nearest neighbour (kNN) method can help address the problem of missing values in two very small, but typical, software project data sets. This is an important question because, unfortunately, many real-...

متن کامل

Dealing with Missing Software Project Data

Whilst there is a general consensus that quantitative approaches are an important adjunct to successful software project management there has been relatively little research into many of the obstacles to data collection and analysis in the real world. One feature that characterises many of the data sets we deal with is missing or highly questionable values. Naturally this problem is not unique ...

متن کامل

Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation

Missing data is a widespread problem that can affect the ability to use data to construct effective prediction systems. We investigate a common machine learning technique that can tolerate missing values, namely C4.5, to predict cost using six real world software project databases. We analyze the predictive performance after using the k -NN missing data imputation technique to see if it is bett...

متن کامل

A New Uncertain Modeling of Production Project Time and Cost Based on Atanassov Fuzzy Sets

  Uncertainty plays a major role in any project evaluation and management process. One of the trickiest parts of any production project work is its cost and time forecasting. Since in the initial phases of production projects uncertainty is at its highest level, a reliable method of project scheduling and cash flow generation is vital to help the managers reach successful implementation of the ...

متن کامل

A Multi-Criteria Analysis Model under an Interval Type-2 Fuzzy Environment with an Application to Production Project Decision Problems

Using Multi-Criteria Decision-Making (MCDM) to solve complicated decisions often includes uncertainty, which could be tackled by utilizing the fuzzy sets theory. Type-2 fuzzy sets consider more uncertainty than type-1 fuzzy sets. These fuzzy sets provide more degrees of freedom to illustrate the uncertainty and fuzziness in real-world production projects. In this paper, a new multi-criteria ana...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Journal of Systems and Software

دوره 80  شماره 

صفحات  -

تاریخ انتشار 2007